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ABSTRACT 

Arguing that writers must be able to evaluate the 
quality and effectiveness of the texts they produce, this paper 
begins by isolating some of the persistent questions raised by people 
in education, business, and government who want to judge how well 
their texts are working. The paper then compares the cognitive 
processes involved in "reading to comprehend text" with those 
involved in "reading to evaluate and revise text," stressing that 
even experienced writers often need help in detecting and diagnosing 
text problems • The paper then characterizes three general classes of 
tests for evaluating text quality: (1) text-focused; (2) 
expert- judgment-focused; and (3) reader-tocused approaches. The paper 
reviews typical methods within each class and discusses the relative 
advantages of reader-focused methods over other approaches. (Four 
figures are included; 150 references are attached.) (RS) 
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Abstract 



To create texts that meet the needs of audiences, writers must be able to evaluate the quality 
and effectiveness of the texts they produce. Over the last sixty years, a variety of lext- 
evaluation methods have been developed and writers can now choose among many 
alternative methods. This paper begins by isolating s 'me of the persistent questions raised 
by people in education, business, and government who want to judge how well their texts 
are working. It then compares the cognitive processes involved in "reading to comprehend 
text' with those involved in "reading to evaluate and revise text," stressing that even 
expenenced writers often need help in detecting and diagnosing text problems. The paper 
then characterizes three general classes of tests for evaluating text quality: (1) text-focused, 
(2) expert-judgment-focused, and (3) reader-focused approaches. It reviews typical 
methods within each class— examining the strengths and limitations of particular tests— and 
discusses the relative advantages of reader-focused methods over other approaches. 
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EVALUATING TEXT QUALITY: 
THE CONTINUUM FROM TEXT-FOCUSED 
TO READER-FOCUSED METHODS 

by 

Karen A. Schriver 
Carnegie Mellon University 



We frequently read texts by writers who fail to consider our needs as readers. 
Wniers may forget to provide a necessary context, fail to include examples, obscure the 
purpose, leave oui critical information, or write too abstractly. Writers of all ages from 
almost every profession share two questions; How can we anticipate and meet the reader's 
needs? How can we know if we were successful? Writers have been found to have 
genuine difficulty both in considering the reader's needs while planning and generating text 
as well as in judging tiieir success during revision. Thus, it is not surprising that people in 
education, business, the healtii professions, and government have been looking for reliable 
ways to evaluate die quality of texts they create. 

Since die 1930s, a variety of document-evaluation methods have been developed and 
wnters are now in die position to choose among alternative evaluation metiiods. In this 
paper, I categorize typical methods for evaluating text quality into three general classes: 
text-focused, expert-judgment-focused, and reader-focused approaches. My aim is to give 
an overview of popular metiiods and to identify tiieir strengtiis and weaknesses within the 
context of what is known about text evaluation. 

Initially, I discuss research in reading and writing tiiat has investigated the thinking 
processes of people as tiiey engage in evaluating text witii die goal to revise. In particular, 
I compare the cogmuve processes involved in "reading to comprehend text" and "reading to 
evaluate and revise text." This research raises die issue tiiat an adequate theory of text 
eva^uaaon must account for what people do as they read witii die intention of judging text 
quahty. This work also points out tiiat adequate testing metiiods must provide writers wid-i 
what they need most for planning or revising: an image of die intended audience interacting 
with the text. I then discuss tiiese issues in the context of the most frequentiy used 
methods witiiin each of die three classes— text-focused, expert-judgment-focused, and 
reader-focused approaches— and show why reader-focused methods have relative 
advantages over otiier approaches. 



QUESTIONS RAISED BY TEXT-EVALUATION RESEARCH 

Text evaluation is a difficult and tangled issue. If you asked a room of researchers or 
praciitioners in die area "What are the key questions in text evaluation" you would hear a 
wide range of issues: ^ wmuucoia 



What are die characteristics of an effective text? 
Can we igree on a working definition of text quality? 

What are the key skills and abilities involved in text evaluation? What do 
expenenced evaluators do that inexperienced evaluators do not? 
What do writers learn from repeated experience in judging text quality? 
How can wt improve evaluators' abilities to judge text quality? 
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• ^t?^?"^^!?^^^ associated with particular methods forjudging text quality? 
What methods produce reliable and valid judgments? •» « » n / 

• What aspects of text evaluation can we automate using the computer? 

• How can the computer help reduce the burden of text evaluation? 

ywfol^-"^ these questions are several themes: Can we identify benchmarks for 
tdrS^y^^^^f'S teach ^valuators to judge the quality of text consistentiy 

and rehably? Can we identify ways to help evaluators improve their skills in judging text? 
How can technology help us in our efforts to assess text quality? Much of the work that is 
directed toward answenng tiiese questions has been conducted by theorists and researchers 
in reading, rhetonc, composition, and document design. ^cacor^ncrb 

Reading researchers have been trying understand differences between what tiiey term 
"considerate" and "inconsiderate" text [1-5]. TTiey have been exploring the kinSte^ 
structures that promote or inhibit comprehension and want to know more about what 
Hf S comprehension process when we encounter pooriy written text. Such work 
Sheds light on what readers do m constructing a representation of a text— whether the text 
\1^^} IT^a!^ ^} ^°™''f emphasize that we need more empirical work identifying 
tiie global and local textual relations which help readers to construct a coherent model of the 
lexi s mionnation* 

Studving literacy in the woricplace is also helping us to understand the demands of 
S 1 (?f dramatically work-type readin<T differs from schooi-type reading 

[6-10]. Such research makes it clear that to meet the unique needs of readers in 
nonacademic contexts, wnters need detailed information about tiie kinds of reading that 
re^n^at wo?k" ^'^^^""a^O" ^bout the diverse purposes, goals, and strategic for 

„o«-,»5f '^i®^°F',^ting, and document design has been trying to identify the key 

variables which underhe skUled performance in creatSg riietorically effective text There 
are now a number of smdies which aim to characterize Sie processes involvwi in p^^^^^ 
wntmg, and revismg text for readers [1M61. Such studies are expK Ae cX?^^^^ 
social and cu tural prg:esses of writers as they engage in creating and evaluating teK 
?.Htl^° differences in writers' abilities to judge text from the perspSf tiV^if the 
audience. Both expenenced and inexperienced wiitew have been found to have more 
^^tir.'^^^'''^ ^^""'^ *emselves than tiiose written by othe? writers S 

?h«S on?f nl'' to identify die strengtiis and weaknesses of someone else's text 
han one s own. For such reasons, researchers have been particularly concerned with 
lllw U??!^^^^^^ ^'^P writers judge text from die reSTpoim of 

acc^cl?!?!" together, work in these areas is changing our thinking about the problem of 
assessing text qu^ity ana is aying die foundation for a theoiy of the process of evSon 
tfn^iTZ' 5^^^ 'T'^ °f Such efforts are belong us mS^ 

informed decisions about what makes a text-evaluation approach useful Moreover we a'p 
beginning to identify metiiods tiiat have the advantage of enhancing bo^h TZuer^s'^ess 
of evaluating text as weU as the reader's process of comprehending and usin^le^^^^^ 

QUAL^Y ™ ^^MPREHEND VERSUS READING TO EVALUATE TEXT 

To understand what an optimal text-evaluation method mieht lonk liir*. u/ritinr, 
reseaichers have been examining the process of eviS itsSKt .1 to 
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cognitive processes of evaluating text with the goal of revising it for comprehensibility 
and/or usability. What is it that expert writers do when making revision decisions that 
improve the text from the reader's pers^jecdve? Do people "read diifferently" when engaged 
in revision? In a recent study of revisic, Hayes, Flower, Schriver, Stratman, and Carey 
[14] asked the question: How is ^'reading for comprehension" different f'om "reading to 
evaluate?" Figures 1 and 2 present hypotheses aoout what some of the differences may 
look like. Figure 1 shows the cognitive processes in reading to comprehend text; it is a 
slightly revised version of the Hayes et al. model which was adapted from the Thibadeau, 
Just, and Carpenter ''reader model" [26]. 

The intention of this model was not to enter the debate about whether reading is a 
bottom up or top down process, but rather to show that when one reads to comprehend, 
one's primary aim is to construct an integrated representation of the text. Put differently, 
during reading for understanding, most of our effort is devoted to ''putting the text 
together" to construct an understanding of how ideas work as a whole. 



Cognitive Processes in 
Reading To Comprehend Text 



READ TO COMPREHEND 
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Figure 1. The Process of Reading for Comprehension (adapted from the Thibadeau, 
Carpenter, and Just model of reading [26] by Hayes, Flower, Schriver, Stratman, and 
Carey [14]). 
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Notice that during the process of comprehending, the reader also sometimes detects 
text problems without much thinking or conscious attention devoted to diem. For example, 
it is common to notice spelling or grammar faults in what we read. When we encounter 
such faults during reading to understand, we typically ignore them. We pay more attention 
to them, of course, if the faults arr bad enough to slow our reading or to make us reread. 
During reading to comprehend, we jnight also note errors or ambiguities in the text's 
information. For example, if we are fam'Uar with die topic, we often have a good deal to 
say about the author's claims, logic, examples, anecdotes, and even choice of language. 



Cognitive Processes in 
Reading to Evaluate Text 
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Figure 2. The Process of Reading to Evaluate Text Quality by Hayes, Flower, Schriver, 
Stratman, and Carey [14]. 
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We can think of our active engagement with the author as conversation, sometimes playful 
while other times aggressive. On the other hand, when we have little or no background 
inJformation on the topic, we are more likely to spend our attention trying to understand and 
connect what we have read with our prior knowledge rather than scrutinizing the author's 
claims. 

Although the activity of reading to comprehend is a very complex process indeed, 
writers faced with the task of revising a poorly constructed text must go well beyond 
comprehending the author's ideas. Instead, when "reading to evaluate text" (Figure 2), our 
goal is to identify weaknesses in the text as well as to find solutions for them. Reading to 
evaluate text can be viewed as a cognitive process which is "built on top" of the 
comprehension process, but with the added top-level goals of comprehending and 
criticizing the text from the point of view of its effectiveness for the intended audience. 
Thus, when engaged in "reading to evaluate," the writer consciously looks for problematic 
text features and attempts to discover alternative solutions. Funhermore, instead of simply 
trying to understand the text as best one can, the revisor must ask, "Is this the most 
rhetorically effective way to present these ideas to the intended audience?" 

One of die key differences between die models shown in Figures 1 and 2 is that in 
reading to evaluate, the writer's problem detections (some examples are shown on the right 
side of the model) become a source for possible discoveries (some examples are shown on 
the left side of the model)--tiiat is, alternatives for improving the text. For example, when 
writers recognize that die audience may not have the appropriate background knowledge to 
follow the text's major claims, they often create new examples and add supporting evidence 
to make Uie text more understandable. Choosing among revision strategies once a problem 
has been noted is often difficult because changing one aspect of the text changes others. It 
is usually hard to decide if one should keep the text basically as it is written but simply to 
change the surface structure (that is, make changes to the phrasing) or delete sections of the 
text as written and make wholesale meaning changes. 



COGNITIVE PROCESSES IN REVISING 

Figure 3 presents a modified version of the revising process developed by Hayes, 
Hower, Schrivcr, Stratman, and Carey a few j^eais ago [14j. The model, derived from 
observing experienced and inexperienced writers at work, is intended to capture the 
thinking processes of writers engaged in text revision. 

As shown, text revision calls on a range of hierarchically organized subprocesses: 

• Representing the rei^/:— characterizing the text's goals, the goals for the intended 
audience, the writer's goals, the goals of otiiers with influence over the text 
(editors, bosses, clients), the purpose for writing, the context (social, 
organizational, historical, cultural) in which the text is being revised, the 
constraints under which the revision is taking place, and die criteria being invoked 
forjudging success. 

• Detecting — ^seeing or noticing problems. 

« Diagnosing — characterizing or describing the text's problems. 

• Selecting strategies — choosing among optional metiiods for solving identified 
problems (rewriting or editing). 



I 



Cognitive Processes in 
Revising Text 



Represent Task 



I 



Detect Problems 



Diognese Problems 



T 



Select Strategies 



Fix Problems 




— No-i 



Figure 3. The Process of Revision (adapted from the Hayes, Flower, Schriver, 
Stracman, and Carey model of re\dsing [14]). 



• Fixing problems^takmg action to solve the problems. 

The research from which this model was developed revealed dramatic differences in the 
abilities of experienced and inexperienced writers to engage in and carry out these 
processes. Within each of these subprocesses, writers have a variety of options. The 
ability to recognize available options and to make changes that actually improve text was 
found to distinguish experienced £rom inexperienced writers. 

Research on revision has been remarkably consistent in isolating two major 
differences detween experienced and inexperienced revisors: 

• Experienced writers are skilled in evaluating global aspects of text quality such as 
rhetorical stance, organization, logic, cohesion, persona, and tone. Inexperienced 
writers axe not. Inexperienced writers tend to focus on local- level errors such as 
word choice, grammar, and syntax. 



• Experienced writers are skilled in taking action to meet the needs of the audience, 
that is, making revision moves that improve the text from the reader's perspective. 



Inexperienced writers often identify the same problems as experienced writers but 
they are frequently unsuccessful in taking action to solve them. In fact, in some 
cases inexperienced writers' revisions introduce new problems and make the text 
worse instead of better [27]. 

From the research in writing, we can conclude that in choosing among methods to evaluate 
text, we need to draw on those that can help us act more like experienced writers. An 
optimal text-evaluation method should provide writers with two sorts of information: (1) 
information about whole-text or global aspects of text quality, and (2) information about 
how the audience may respond to the text. 



THE CONTINUUM OF TEXT-EVALUATION METHODS 

When one examines the kinds of document-evaluation methods currently in practice, 
we find a great deal of diversity both in the level of text problems they help writers to see 
and in the amount of actual reader feedback they provide. Figure 4 presents a continuum of 
text-evaluation methods. It classifies some of the most popular evaluation methods used in 
education, business, the health professions, publishing houses, and government — 
organizations which produce everything from textbooks to computer manuals to pamphlets 
on life-threatening diseases to mystery stories to tax forms. 

The continuum is divided into three sections-^text-focused, expert-Judgment- 
focused, and reader-focused me^/wiis— which are separated by how explicit the feedback 
from the intended audience is. My assumption here is that text-focused methods, while 
sometimes created from information about readers, never use direct reader response; that 
experts—tiu-ough their experience—provide surrogate-reader feedback; and that reader- 
focused methods make explicit use of audience response. I have listed a variety of kinds of 
tests and/or the people who have developed or elaborated them (the list is not exhaustive). 
Under each test (or group of tests) are the typical concerns of evaluators using the method. 
If a group of tests tend to address similar issues, I list the concerns only once. Some of the 
concerns are ideas that evaluators keep in mind, as they judge text quality, for instance, 
principles of style for visual or verbal text; in other cases, the concerns are variables for 
evaluation, perhaps tiie number and kind of errors a text leads a reader to make. Notice 
also that the tests within each class vary in tiie scope of text problems they help writers to 
identify, ranging from word-level to whole-text level problems. 
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Text-Focused Evaluation 



On die left side are text-focused methods or those which operate by asking a person 
(or sometimes a computer) to examine a text, attend to a set of text features, and assess text 
quality by applying principles or guidelines that have been developed from ideas (and 
sometimes from research) about how readers at a certain level and background will 
probably respond. Thus, the reader's input, when used to develop such tests, is indirect at 
best Text-focused methods include readability fommlas, computer-based stylistic analysis 
programs, guidelines and maxims, and checklists. 



Readability Formulas 



rhnii nn^!!!^^??^^'"^^'' f^^^' t29]. SMOG [30], Dale and 

i-naii iJij, try [32], or Kincaid [33] formulas operate by analyzing wore frequency and 
sentence length. Such procedures have been discussed and severely critiqued at length by 
many researchers [34-38] and it is not my purpose to belabor their obvious deficiencies 



Evaluating Text Quality: 
The Continuum from Text-Focused to Reader-Focused Methods 
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• problem solving strategics 

• corrprehension 

• miseuee and error recovery 

• occeu & retrieval behaviors 

• inference & predictions 

• satisfoctton/preference 

Retrospective Testing 

V Comprehension (true/false, etc.) 

• pofophrose 

• reeaU/sunrunary/gjy 

• recogniKon 

• inbrenee 

V Surveys, Iniewews & Focus Groups 

• rank/note visual & verbal text 

• compfehension 

• persuosiveneu & believabiiity 

• sotlsbction/prebnoce 

• offitudeeA beliefs 

• inbrenee 

V GiHcol Incidents/Storytelling 

• key events A ir>ckbnts 

• rebft^onee/sevedV judgments 
VReoder feedback Cods 

• comprehension 

• sotlsbetion/preference 

• ottitudee I beliefs 



Key 

V « on exampb of a portfcubr method or individuals v^o devebped or ebboro^ a me«hod 
• » atypical focus or dependent measure during evoluatk)n 



ted MeSE^s"''^"^ "^''^ ^""^^'^^ Continuum from Text-Focused to Reader- 
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again. Research about how people use readability formulas has shown that they are often 
misused and misunderstood. Rather than using them as a gross index of the readability of 
a final drfit, evaluators tend to use the formulas for specifying how writers must plan, 
write, and revise. Thus, "meeting the readability level" becomes the primary criteria for 
judging text quality. Unfortunately, there is no evidence to support this practice; in fact, 
just the opposite is true. To understand how loose the relation between comprehension and 
readability formulas is, one need only notice that a passage will get the same readability 
score whether its words are arranged in normal or backward order. 

Indeed, research shows that writing to a readability level is an extremely questionable 
means for improving the comprehcnsibility of text. In discussing the use of readability 
formulas in the assessment of textbook difficulty. Singer and Donlan assert that sentence 
complexity and word frequency are only partial indicators of text difficulty because 

... a text may be relatively difficult because it has a high density of ideas 
and a high degree of interrelatedness or coherence among ideas. But, 
whether tiiesc characteristics of a text are difficult or not also depends upon 
the reader's prior knowledge, vocabulary ability, reasoning processes, 
purposes, and goals in reading the text. For example, if a text is densely 
packed wiUi ideas but the reader's purpose is only to get the general idea of 
die text, the reader is likely to find the text easier than if his or her purpose 
was to comprehend the text fully. Hence ... the difficulty level of a text as 
computed by the Fry and Flesch formulas ... is only tlie average or general 
level of difficulty of a text. 

To determine the difficulty of a text for a particular reader, for example, a 
student, who was having difficulty in reading and learning from a text, we 
would examine factors not only within tiiat text but also within the reader. 
In short, reading difficulty for a particular individual depends upon an 
interaction between the text and the individual [39, 330]. 

But because they are relatively easy to automate and cheap to employ, many organizations 
use readability formulas exclusively, despite die lack of empirical support for their validity 
in assuring text quality. In discussing methods that are likely to be important in the future 
of prose processing research, Voss, Tyler, and Bisanz [40] dismiss die future impact of 
readability research, devoting less tiian a paragraph to die topic. 

Computer-based Stylistic Analysis Programs 

Computer-based style programs (for example [41-43]), such as UNIX's Writer's 
Workbench [44, 45] or the GM Star program [46] typically operate by assessing 
readability using one or more of the standard formulas and by counting passive 
constructions, misspellings, numbers of simple, compound, or complex sentences and then 
by providing die evaluator with a statistical summary of die text problems by assigning 
particular features an average score by comparing the use of the text feature, e.g., number 
of passive sentences, against die proportion used in a "good text" template. As Figure 4 
shows, die focus of critiquers has been proofreading at the word or sentence level. 

For some time, companies have been trying to improve on die range of problems 
computer-based style programs check. Lance Miller, in describing the "space of possible 
critiques," describes a number of key distinctions that are important in evaluating the 
goodness of a style program: 



(1) the examination text-unit, (2) the report text-unit, (3) the critique type, 
[and] (4) the strength of the critique report. . . . The examination text-unit 
refers to the unit of text which is examined for the presence of some target. 
If the critique is that of spelling-checking, then the examination text-unit is a 
word. . . . 

The report text-unit is the unit at which the critique is made, and this unit is 
either the same as the examination unit or else larger. An example of the 
latter instance is when a text is critiqued for low frequency words 
(examination-unit » word) and the results are summarized on a paragraph 
basis (report-unit = paragraph), e.g., 'This paragraph contains the 
following low frequency words." ... 

The third distinction, critique type, refers to the manner in which tiie critique 
is made, and the two options are isolated vs. relative. In an isolated 
critique, a particular examination-unit is compared against a :5tandard, and 
the judgment can be rendered witiiout taking into account the characteristics 
of that unit relative to other units. Thus checking for spelling errors, 
incorrect capitalization, overly-long sentences . . . involve an isolated 
critique. In contrast, a relative critique checks tiie characteristics of one text- 
unit (having certain features) against tiie characteristics of anotiier text-unit 
(having different features); the logic of the comparison is along tiie lines of 
''if tiie first unit has an aspect of X, tiien the second unit must have an aspect 
of Y.'* Most ungrammaticalities, such as disagreement in number between 
subject and verb, involve a relative type of critique. 

The fourtii distinction concerns critique strengtii for which there are also 
two possibilities: right-wrong vs. threshold. A right-wrong judgment is 
one in which one can say "Right!" or "Wrong!" without fear of 
contradiction (from experts), as is the case of tiie majority of grammatical 
errors. ... On the other hand, questions of style are not only matters of 
taste but . . . need to be reported witii some deference and sensitivity to tiie 
fact that tiie autiior and critiquer may not share tiie same standards. One 
means of systematically handling tiie problem of varying stylistic standards 
is to anange to have each stylistic evaluation result in the computation of a 
single number whose value grows witii the severity of tiiat particular gaffe; 
this value can then be compared against the threshold ror a particular 
enterprise, and, if it exceeds that threshold, a suitabL? commentary is 
provided 147, 195-196]. 

It is not surprising that most early style programs looked at tiie word and sentence 
level, summarized at the sentence and paragraph level, focused mainly on on isolated 
critiques, and on right-wrong judgments. Miller argues that the primary challenge for 
developers of computer-based style programs is to go beyond fhe basics and to increase the 
space of critiques provided. Similarly, Richardson, Creed, and Chandler point out that 
most stylistic programs cannot address the kinds of grammatical problems that poor writers 
often create; the fundamental drawback of most programs is tiiat "they rely too much on 
lookup tables instead of a parser to determine tiie roles words play in a sentence" [48, 57]. 

One program tiiat aims to go well beyond tiie basics is IBM's Epistie system, now 
called Critique. It is developed by linguists and artificial intelligence expens at IBM's 
Watson Research Laboratory (49-51]. Recently (June 1989) IBM released Critique. 
Reporters from tiie machine translation magazine from the Netheriands, Language 
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Technology Electric Word, who put the prototype through its paces in July 1988, described 
its features in this way: 

Identification of unrecognized words or awkward phrases, checking for 
spelling errors, grammar and style errors and the generation of statistical 
information. It appeared to be fast and reliable. 

The program is written with Penelope, Heidom's Programming Language 
for Natural Language Processing, and is based on colleague Karen Jensen's 
PEG (PLNLP English Grammar.) It parses a sentence, provides a 
syntactical representation, then employs hundreds of grammar rules to 
check the sentence's grammatical structure, before it highlight [sic] 
problems on the screen. Users will be able to establish individual profiles 
so that Critique will also reflect personally selected criteria [52, 7]. 

Currently, Critique runs as a new feature of IBM's mainframe editing software Process 
Master 1.3 (running on a VM/CMS operating system). Reporters speculate that there may 
be a PC version under development. For information on how Critique is being used in 
writing classes, see Richardson, Creed, and Chandler's summary of a pilot program at the 
University of Hawaii at Manoa. They point out three virtues of the program: 

• Writers can use it interactively. 

• It has three levels of help screens tiiat provide information about principles of 
grammar and usage. 

• It provides parse trees for each sentence it processes, tiius allowing writers to see 
tiie structure of tiieir sentences [48, 58]. 

Two other style checkers are worth note (they won the 1989 State-of-the-Art Electric 
Word Awards for Technical Excellence): Grammatik III for tiie PC and MacPi'oof for the 
Macintosh: 

Grammatik HI made by Reference Software Inc. proofreads documents for 
errors in grammar, style usage, punctuation, and spelling. Grammatical 
errors identified include improper use of homonyms (its/it's, 
they're/there/their) and possessives (you/you're, who's/whose) 
transpositions (form/from), disagreement between subject and verb (the 
government think) redundant comparatives (more better), incomplete 
sentences, double negatives, and split infinitives . . . also checks jargon, 
sexist terms, redundant phrases, neologisms, and overused phrases . . . 
also flexible enough to allow you to turn off rules and even add new ones of 
your own . . . and the documentation is so well written tiiat even the 
layperson can make such modiflcations. 

MacProof checks on what its makers, Lexpertise Linguistic Software, call 
mechanics, usage, styie, and structure . . . "mechanics" refers to spelling, 
P"nctuation, capitalization, and double words; . . . dictionary contains 
120,000 entries. The "usage" dictionary containr 10,000 terms to be 
nagged for such barbarisms as offensiveness, imprecision and verbosity. 
"Style" means little more than flagging tiie verb "to be" . . . and "structure" 
is essentially about counting words in sentences and lines in paragraphs . . . 
it checks for logical transitions between paragraphs . . . [53, 35]. 



Guidelines and Maxims 



Guidelines and maxims are perhaps the most popular text-focused method used. 
They are usually aimed at giving writers advice on the linguistic, stylistic, or graphic 
features of text (for example [54-57]). From a writer's perspective, most guidelines are 
frustrating to use either because they are vague and generic, e.g., "omit needless words" 
(Strunk and White [58]) or because they force us to assume that all wri'ing tasks are alike 
and require the same simplistic prescriptions (e.g., "use short sentences"). Put differently, 
guidelines often fail to help writers adapt their texts to the unique features of the given 
rhetorical situation. 

Furthermore, evidence suggests that writers have difficulty recognizing when and 
how to apply guidelines [23, 59-61]. When guidelines arc invoked too rigidly, they 
function as rules and can have the effect of stifling creative solutions to rhetorical problems. 
Although tiiere are genuine difficulties associated with tiie guideline approach to judging 
text quality, there have been some very good examples of the effective use of guidelines, 
such as Williams* well-known text on style [57], 

Checklists 

Checklists, anotiier text-focused metiiod, typically work in one of two ways. On tiie 
one hand, tiie evaluator is asked to use the checklist as a reminder of issues to consider. 
For good examples of checklists, see Price's "giant checklist" for writing computer 
documentation [62] or Spencir's "usability considerations checklist" for testing computing 
systems [63], Many checklists focus on recommending visual or verbal text features to 
employ or those to avoid or use sparingly. Other checklists are essentially additive 
weighting procedures which ask tiie evaluator to rate tiie text's features along a "goodness" 
scale and tiien to assign a quality score to tiie text. (See Hayes [64] for a discussion how 
to design an additive weighting scale.) 

A drawback of checklists lies in die difficulty of deciding what text features sue most 
important and in assigning weights or numerical values to text features. Writers usually 
disagree about tiic values assigned to any given feature. And checklists, like guidelines, 
usually fail to ask evaluators to judge the use of text features in relation to tiie given 
rhetorical context. For example, tiiere are many rhetorical situations in which tiie passive 
voice IS the most sensitive linguistic choice, yet mosi checklists remind writers to avoid 
using passives. Such situations leave the writer witii tiie questions: How "bad" is a text 
feature tiiat is rated average or below average? If two texts receive tiie same low score but 
are intended to serve different rhetorical purposes, are tiiey equaUy poor? How should text 
feature ratings be used in revision? Should aU poorly evaluated text features be revised 
extensively? 

It should also be pointed out that most checklists are not based on data from readers 
or users of the text under evaluation. Rather they are often created by consolidating an 
organization s conventions and accumulated folklore about the features of good and bad 
audience^"^' checklists may simply codify an organization's misunderstanding of the 

Summary 

Advantages of text-focused methods are tiiat tiiey arc inexpensive to use, some 
can be automated, and tiiey can be helpful in detecting certain obvious classes of error. The 
inherent weakness of these methods lies in their predominant focus on word and sentence- 
level features of tiie text. Typically, tiieir output provides littie, if any. information about 
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how the document is working at the paragraph and whole-text level. Perhaps the biggest 
weakness is that their output provides no information about the reader's needs. When text- 
focused methods are used as the only guide for revision, research by Swaney, Janik, 
Bond, and Hayes [22] shows that revisors may actually make the text worse instead of 
better. 

Expert-Judgment-Focused Evaluation 

Expert-judgment'focused methods constitute another widely used set of evaluation 
procedures. By expert judgment, I mean individuals who possess high knowledge about 
the text, its audience, or writing itself. Expert-judgment-focused methods include peer 
9 reviews, technical and/or subject-matter expert reviews, editorial reviews, and external 

reviews. 

Peer Review 

Peer review is one of the more standard expert judgment methods employed by 
education, industry, and government [65-68]. With peer review, people who share a 
common background are called upon to evaluate texts for issues of style, consistency, tone, 
and the like. Peer reviews can be very informative in pointing out text problems, allowing 
the writer to draw on the multiple perspectives of other writers. Peer reviewers tend to be 
quite good at recognizing stylistic issues at both tiie local- and global-level, and writers fmd 
tiiat peers are helpful in making suggestions to solve organization problems. 

However, some writers report that peer review can also be a frustrating experience. 
When die writer receives divergent opinions about the problems the text will create for 
readers (or when personalities enter into decisions about what is problematic) it is often 
difficult to determine which problems to solve and which suggestions for revision to use. 
This difficulty is magnified when tiie revisor is operating under severe time constraints. 

Peer reviews can also suffer from evaluators who work too frequentiy with texts of 
similar genres and subject matter. Writers who always evaluate the same sort of text — for 
instance, proposals^-may not improve in tiieir skills over time, but may actually erode their 
skills by doing too much of tiie same kind of text evaluation all the time. When evaluators 
alwiys work witii tiie same kinds of texts, tiicy can become insensitive to the audience's 
likely response to texts of that sort. Researchers who studied experienced U.S. 
government writers at the IRS, for example, found that evaluators were particularly 
insensitive to language and stylistic issues tiiat botiiercd readers outside that institution 
[69]. Indeed, peer review is a way of socially constructing and institutionalizing certain 
styles. 

Peer review has also come under question by authors who submit articles to 
professional journals tiiat use peer review for judging manuscripts for publication [70, 
71]. Autiiors whose work is evaluated by peer reviewers sometimes question the criteria 
used for msJdng decisions about what gets published and what does not. They suspect that 
it is almost impossible to conduct a truly "blind" review since often the peer can guess the 
author's identity by carefully examining the reference list [72, 73]. Because peer 
reviewers for journals serve such a critical gatekeeping function, authors are concerned that 
peer reviewers invoke consistent standards for all manuscripts received. 

Technical and/or Subject-Matter Expert Review 

Technical and/or subject-matter expert (SME) reviews usually conduct content 
evaluations of text, aiming to find deficiencies in coverage, accuracy, authenticity, or 
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completeness. In many industrial contexts, for example, technical reviews are conducted 
by engineers or computer scientists who assess a text's content in terms of its match with 
the funcnonality of a product or a machine. Technical reviews are intended to provide 
writers with detailed information about die ways in which text content is inaccurate or 
misleading. While a technical review can be conducted by a technically-oriented person 
like a computer programmer who is verifying the procedures presented in a user's manual, 
this IS not always the case. The phrase technical review is also used to refer to evaluations 
by subject-matter experts who verify text adequacy, like a mnseum historian who is 
ventying the accuracy of facts presented in a brochure. Those who participate in subject- 
matter expert reviews are typically extremely knowledgeable about the content, the 
information medium, the audience, or the rhetorical situation in which die text will be read 
or used. 

Subject-matter expert reviews conducted by marketing experts, for example, may 
conduct & presentation and delivery critique, checking for features such as die tone and 
mood created by the integration of the visual and verbal text. Thus, tiiey may evaluate die 
presentation and the deUveiy of the content in terms of its match to a set of articulated goals 
(for example, die text must be short; it should present a theme; it should use vibrant color 
and visuals) or against a set of esthetic criteria (for instance, die text should convey 
senousness and wanndi). ^ 

^ ju^l^^^"^^ technical and/or subject-matter expert reviews do give valuable 
feedback about difficulties widi a text, it may be unwise to use such reviews in isolation. 
Research is beginning to show that topic knowledge is sometimes a detriment instead of a 
help and that experts are not always die best people to ask about text quality. Hayes, 
Schnver, Blaustein, and Spilka (74] found what they term "the knowledee effect in 
wnting : readeij widi high topic knowledge were very poor in judging how lay readers 
would understand die topic. jo© / 

Similarly, in anodier study, I found diat writers widi 2 to 3 years of experience with 
word processing were extremely insensitive to judging die kinds of problems new users 
would have widi pooriy written procedural instructions for a word processor [151. To 
help wnters recognize and overcome their insensitivity, I asked diem to study the 
transcnpts of thmk-aloud protocols from a group of new users which demonsiated 
numerous comprehension and usability problems. After reading users' comments 
illusttatmg dieir unsuccessful atteinpts to invoke simple commands, some writers reported 
that the users errors seemed snipid and diat it was hard to remember what it was like to be 
a newcomer to computers. Such research reminds us diat writers, technical experts, or 
subject-matter experts with high topic knowledge may find it especially difficult to 
anacipate die needs of readers widi low topic knowledge. uii"cuii lo 

Editorial Review 

u^i ^^,^^o"al in-house reviews, another expert judgement evaluation procedure, are 
typically earned out by semor wraers or copy editors who check for such issues as style 
nn li!^'^^; specifications, or use of conventions. Traditionally, editorial .^views focused 
on ^mmar and mechanics. Bourns and Grove point out diat in many settings, editorial 
reviews used to be quite mechanical and tended trbe extremely nile-oriented [751 More 

!f4t"n/«HSf rrw^f °^ "^'^"^ '^^^^^^^ expanded'to issues of org^^^^^^^ 

presentation, readability, coherence, retrievability. and accuracy. Put differendy, editors 
have moved away from a one-dimensional view of what diey do and now see their work as 
a complex hierarchy of skills and perceptual abilities [76-79] 
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Another way that editorial reviews are changing lies in the kinds of advice they 
provide. In the past, most editorial reviews were viewed as activities designed to find 
errors in text. Today, most editors consider their role much broader than the wordsmith 
who looks for problems. Instead, they view their role as discovering ways to improve text 
(see Henke [80] for a brief discussion of the usefulness of tabulating editorial 
contributions rather than number of errors found). In effect, the defmition of an editorial 
review is slowly changing from editing to revising, 

A similar evolution in thinking has occuired in the research on composing. Although 
early research in composing focused on studying editing and mechanical correctness, 
today's work looks at the process f whole-text revision. Studies show that expert writers 
are much more than standard good editors; they are able to "resee" text in ways that 
standard good editors cannot [14, 81-84]. Put differently, expert writers are revisers, 
not editors. 

Although we have seen dramatic practical improvements in the editorial review 
process, we have seen almost no research m the area. Longitudinal studies need to be done 
which track the editorial review process over many writing tasks and which focus on 
particular writers working alone and collaboratively. Such work might find that some 
skills get much better witii time while others get worse. As mentioned above, research 
investigating the "knowledge effect in writing" [74] provides us with reason to suspect 
tiiat some eStors may have an "in-house effect'*: they have oeen editing within the same 
context on similar text types too long. Alternatively, we may find what we already believe: 
Experienced editors, unlike many writers, are much more skilled in recognizing the 
audience's needs and in making effective linguistic and rhetorical choices that meet those 
needs. 

External Review 

In many contexts, it is impractical and even undesirable to judge text quality using 
people who are insiders to the context, like peers or technical and/or subject-matter expens. 
In such cases, external reviews are used forjudging text quality. Organizations often turn 
to external reviews when they recognize that sometiiing is wrong witii the texts tiiey 
produce but are uncertain how to pinpoint the problems and need to gain a fresh perspective 
on tile quality of their document design. Thus, many document design and graphic design 
consulting agencies are retained by organizations who want critical feedback about how 
tiieir texts are functioning from a competitive standpoint. External reviews vary in the 
metiiods employed to conduct them and die people who cairy them out 

One type of external review, a text features evaluation, criticizes the relative goodness 
of a text by assessing the design of visual or verbal features. Text features evaluations 
typically involve selecting a representative set of an organization's texts and then analyzing 
them in terms of key features, such as style, tone, content, format, grid systems, logos, 
and so on. In this way, text features evaluations aim to characterize how the integration of 
the visual and verbal text shapes the organization's public image. From such a diagnosis, a 
new plan can be deriv^ that better matches die organization's goals. 

Another kind of external review uses holistic rating metiiods to judge text quality 
[85-89]. According to Chamey, "holistic rating is a quick, impressionistic qualitative 
procedure for sorting or ranking samples of writing. It is not designed to correct or edit a 
piece, or to diagnose its weaknesses. Instead, it is a set of {procedures for assigning a value 
to a writing sample according to previously established criteria" [85, 67]. Holistic rating 
refers to the set of methodologies used to arrive at a total impression of a text. Testing 
agencies such as the Educational Testing Service (ETS) use holistic scoring to judge 
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R^f ?^ Scholastic Aptitude Tests (SAT) and the high school Advanced 

SSni ti^^^T;^^''' y ^^^^^^-^ °" how to derive a hoUstic mting; two 
of the more typical methods arc general impression marking and primary trait scoring 

General Impression Marking is a methoil in which : "rater fits a writins samtJle into 
an ordered ranking on the basis of the total impression coated by the p^fWl] Thl 
defining charactenstic of this appro^n is that it weighs sample pape^ aeainst each other 

' P'^deteirnincd set of criteria-'tss, 72J. ffieKlSved S 
inductively by either test organ^^ers or by the evaluators themselves. Often test o^anizers 
rZl gjn^ji^'npression mPiking will select a set of "anchor texts" which reoreslm^^^^^^^ 
range of goal to poor texts^' the udges can expect to see. Evaluators are then trained to 
judge a set of texts against the anchor papers. 

Primary Trait Scoring, developed by Lloyd-Jones [90], is different in that it gives 
' ""^T^ ^""^^ ^P^«^ ^« Judging tik; ks, it ufes a set of ixpHcU 

Lt of t«il^^^^^^ , '^"^ ^ ^'^^ "Sing ^« agre«i.upon 




Z':T^^ ^^i-'^^- ^h^^y ''''' ^ """^ber of itSd^rwS shS^^ 

rh««?lS?i"®V"^^'^»jy^^^ y^^e^y influenced by salient, though superficial, 
cl^aractenstics of wntmg" (speUmg, length, unusual words, and the quaUty of handwriting 

f!n hihn^^i?"^^ .'^^?" ^h®y ^S'^ °n ^« predetennined criteria, they tend to 

^ni n?hJr?L°^*' ^^5"^ ^^^"^ ^^^^ ^ ^ cvSuation. For such reasons, Chame? 

^iedi^w '^^^^"^^ ^ ^^^^ °^ holistic sc^ng 

Another type of external review is the consumer advocate review conducted by oeoDle 

fnct3?;Ic ^.^^^^ of Consumer Affairs has evaluators who judge the clarity of 
msmictions, wairanties, and contracts (see the Consumer Resource HarSbook[94]) 
Sot^^lTi?'"'^ 1"^ legal, healtii, and safety impUcations of poorlyTsK 
roZl^*."^n'5?"''^^^^^ '"^ ^ *® Baldridge, forSS U.S. ScSctery of 

Commerce, and Lee L. Gray, former U.S. Director of Consumer Affairs went to^fcat 
engths to stress that "diking or writing in plain English is rchXgeTli)^^S^^^^ 
and public sectors" (95, preface]. Their important work, some of Sfe Mte of^^^ 

rn^^^ ^^""y" ^''^^^^ ^^^^^^^ T-wi/v^ Ca^e 5mrn95]rprovid1! 

concrete evidence of the enormous practical and financial beneaS assSiared wiS 

S^oSoWl^^tl"^ ^"^^^ insurancTpS^^ 

so 

foftQ"^.;! ^°".'""?'' ".^i'^s ^^"^ te'^' quality than ever before S)r S 

1989, MACazine mtroduccd a feature called "Reader Reports" in wWch cv^uate 

computer products along various dimensions, and one of tL U features 

?i?nM ^^kT?^"^^°"u f^^^- Suiprisingly in their U suwey o5ir 13^^^^^^^ 
responded, highlighting that consumers of high technology want va knl^ mtxr^ 

A gatekeeper review is one in which a text is evaluated by a srouo of individual u/hn 
HSsS^li" "^"^^^ " According to the U.l. of^^I^ of He'STaSS 



16 21 

o 

ERIC 

1 I 



Often, public and patient information education materials are distributed to 
their intended target audiences through health professionals or other 
intermediary organizations. These intermediaries act as gatekeepers, 
controlling the distribution channels for reaching target audiences. Their 
approval or disapproval of materials is a critical factor in a program s 
success. If they do not like a poster or a booklet, it may never reach the 
intended audience. . . . Questions may include such areas as overall 
reactions to the materials and assessments of the appropriateness, 
completeness, and utility of the information [97, 25]. 

Along with gathering information about whether a given final draft "will fly" in the 
particular context in which it is intended, gatekeeper reviews are sometimes used to help 
writers plan their texts. Floreak presents an interesting case study describing how 
extensive interviews with gatekeepers in a small town's community services orgaiiization 
provided valuable insight into Uie target audience for a poster cainpaign designed to help 
low literate parents care for their youngsters [98]. Gatekeeper reviews then can be helpful 
in both planning and revising text. 

Another type of external review is the document design process critique'^^n 
evaluation procedure that focuses on identifying predictors of poor wnting quality [99]. It 
is designed to help identify weaknesses in tiie ways in which a writer, a group of wnters. 
or an organization, engages in tiie process of creating text The idea is to try to predict (and 
prevent) poor writing before it occurs. Process critique evaluators examine the approach to 
planning, generating, revising and evaluating text. They look at the way people 
collaborate; the guidelines writers follow, the kinds of feedback that goes mto the shaping 
of a text— in effect, evaluators pay particular attention to the way typical wnung tasks get 
done, assessing project management, observing the nature of commumcation channels (tor 
example, between writers and technical experts) throughout a writing project. The goal is 
to identify the stiengtiis and weaknesses in the process along with recommending educanon 
or research that will help remedy the weaknesses. 

Summary 

Although expert-judgment-focused evaluations are useful and can provide a 
wealth of information for the writer, they often suffer from the evaluators being too close to 
die text or product the text describes. In many contexts, the only readers who participate m 
evaluating a text are the readers within an organization who know most about the text 
and/or the product it describe— peers, technical experts, and subject-maner experts. The 
result is that the text may work well for people such as engineers, computer scientists, and 
marketing specialists— people who developed or influenced the creation of the text— but 
may fail miserably for the average reader. Certainly external reviews are quite helpful in 
supplementing standard inhouse evaluation procedures. But expert-judgment focused 
evaluation methods should not be used in isolation; they need to be supplemented with 
other document evaluation procedures, particularly those which are reader-focused. 

Reader-Focused Evaluation ^ 

Reader-focused text-evaluation methods — on the right end of the continuum— are 
procedures which rely on feedback from the intended audience. There are two general 
classes of reader feedback methods: concurrent tests (which evaluate the real-time 
problem-solving behaviors of readers as they are actively engaged in comprehending and 
using the text for its intended purpose) and retrospective tests (which elicit feedback after 
the reader has tinished with reading and using the text). Concurrent reader feedback 
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merfiods include cloze testing, behavior protocols (sometimes called motor protocols) 
performance testmg, and thinking-aloud verbal protocols. Retrospective tests include 
feedback c^ds" ' ^^^y^* interviews, focus groups, critical incidents, and reader 

Concurrent Testing 

The cloze /ej/ [100-102] presents readers with text which has had words 
systematically deleted, ashng readers to try to fill in the missing words. The idea is that 
quality text should have a high degree of lexical predictability. Thus, if a text is "good," 
readers should be able to fill i. ' the blanks. To use the cloze technique, evaluators: 

... simply delete or omit every fifth word from a passage of approximately 
250 words, but the sentence before and after the passage is left intact. A 
total ^50 words will be deleted firom the passage. The reader's task is to 
infer from the remaining content what the missing words are, retrieve the 
exact words from vocabulary stored in his or her memory, and insert them 
into ihe passage. In scoring, only the exact, original word is counted as 
correct The cloze technique places a premium upon the reader's ability to 
infer the missing words from the semantics and syntax of the remaining 
words m the passage and upon the reader's vocabulary repertoire and ability 
to retne^e words from storage in memory [39, 311]. 

cn^J-^*i^^°i® ^"?if,?^i"S '^cause it does take real readers into account and 
surpnsmgly, the actiMiy of filling-m the blanks does appear to draw on many levels of the 
reading process—word recognition, knowledge of syntax and semantics. However, it 
be limited m the genres to which it can be applied. It seems best suited for 
narrative and expository text and seems most unsuited for procedural or reference texts 
for example, ttie cloze test would be a very bad test to evaluate the quality of a telephone 
book. It also fails to provide any feedback about how the text is working from a visual 
perspective. 

r^nr^r^a^l^V^ ^"J^ °( concuFTent testing involves collecting behavior protocols, that is, 
recordings of readers' actions and behaviors. The primary feature of behavior pro ocols is 
that participants do not talk aloud while performing a t^-they siinpfyTKk^^^^^^^^ 

c£«'ni"vl!r • ^"^""'^^ ^'^^^ a computer progSun records what Uiey do t\l^^^^^^ 
collecting behavior protocols are often interested in such issues as the foUowing: 

* SS^.?!??^® comprehend information and solve problems with text that is 
presented in prose and/or with diagrams, illustirations, or pictures. 

' Pow quickly and accurately people can perform a task using only printed 
S)eraSrrvC% "''"^ ^ ""^"^ ^° assemble a bicycle or to 

' SJfJJc'^^'l^, ^"^^ infonnation in lengthy texts such as reference guides (in 
indexes, in tables of contents, in glossaries). 

* !i^!!.w^"®l2^^ ^^^^^^ ^^^^^ P""^®^ insti^ctions (whether in hardcopy or 
th j^Sl ^?.'^'' with how users recover from en^re as 

teof aSom^uS^^ """^^'^ "^P^ ^^"^ ""^^ ' 
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• How computer interface design features such as color, windowing, or display rate 
influence people's ability to use computers (evaluating the differences between a 
small CRT screen and a large bit-mapped display). 

Behavior protocols include keystroke logs, eye movement studies, and user-edits. 
Keystroke logs, which can be collected automatically during interaction with a computer, 
provide detailed information about users* error and error-recovery patterns and can be used 
to develop models of users' behavior [103, 104]. 

Eye movement protocols have been used to determine the effect of colors, display 
rate, and cursor movement in online documentation and interface design [ICS]. They have 
also been used to study how people read scientific texts involving prose and diagrams 
[106]. At this point, most of the work in this area is concerned with studying the behavior 
of the eyes during reading from a computer screen rather than using the method for text 
evaluation. Voss, Tyler, and Bisanz point out that: 

Although there are some problems with interpretation of what eye 
movements reflect (see McConkie, Hogaboam, Wolverton, and Lwcas 
[107]), most research has validated the assumption that the position of the 
eye at any given time corresponds to what is cuirently being processed (Just 
and Carpenter [108]). The measures obtained from eye movement data can 
include the number of fixations within a given text portion, the number of 
saccades, the number of regressive eye movements, or simply the total gaze 
duration, independent of the number of fixations. Rayner [109] provides a 
I good summary of these various approaches [40, 380]. 

Another type of behavior protocol, the user-edit, first described by Atlas [17], 
involves observing readers directly while they work and interact with a machine, using 
only its operations manual as a guide. The obsenrer (who sits eitiier near the user or in 
another room while observing through a two-way mirror) pays close attention to how 
readers use text, when they use text, and how tiie text helps or hurts understanding. User- 
edits are now widely used in industry to evaluate usability of text. 

Performance testing characterizes the class of tests in which evaluators monitor 
factors such as readers' task performance, retrieval and access behaviors, error recovery 
strategies, cognitive load, and general ability to use a text [24, 63, 110, 111]. Thus, 
user-edits are a type of performance test. Evaluators using performance testing are often 
concerned with obtaining benchmark information about speed and accuracy [112, 113]; 
tiius, talking aloud is an undesirable activity because it adds to tlie time on task. However, 
since it is often hazardous to infer problem solving strategies without more explicit 
indicators of thinking such as tiiose gained tiirough verbal reports, many evaluators use 
performance testing to look at large numbers of participants and supplement tiieir evaluation 
with case studies using think-aloud protocols. As Evans points out: 

Used as pan of a wider research project, case studies can provide material to 
illustrate or test a theory, and they may . . . help to humanize, what, without 
such additions, might be an arid statement of observations or facts. 
Research which has been reduced to mere statistics can seem very remote 
from the flesh and blood world we know, and case studies, judiciously 
used, can reclotiie the bare bones . . . [114, 11]. 

Clearly, performance testing has be<^a and will continue to play a major role in text 
evaluation in the future. See Schumacher and Waller [115] for an excellent review of 
frequentiy-used methods in document design. 
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Thinking-aloud protocols ask participants to perform a task while thinking aloud as 
they interact with a document and/or with a machine [22, 116-123]. When people 
experience difficulty in comprehending or in using the document, their comments typically 
reveal the location and nature of the difficulty [20]. Unlike participants in behavior 
protocols, think-aloud participants are asked to verbalize anything that comes to their mind 
as they are engaged in the task. Because thiiixing-aloud protocols are collected while the 
person is reading and is engaged in the process of comprehension, they provide much more 
explicit and complete information than do readers' comments collected after reading is 
fmished. The advantage of think-alouds is tiiat participants often say how and why they 
are having a difficulty with the text. Therefore, tiie writer has both locative and diagnostic 
information that will help guide revision decisions. In addition, tiiink-alouds often 
highlight botii visual and verbal text problems caused by either what has been written or by 
what has been omitted — an important advantage over other document-evaluation 
procedures. Thus, tiiink-alouds are typically used when the goal is to assess how people 
understand, solve problems with, draw inferences about, use, or read text [21, 119, 124* 
127]. 

In the early 198Gs, Hayes and his colleagues at Carnegie Mellon University's 
Communications Design Center pioneered a technique using thinking-aloud protocols 
called protocol-aided revision to revise texts such as insurance forms, apartment leases, 
computer manuals, and medical consent forms [22, 116, 118, 128]. Protocol-aided 
revision i$ a process in which evaluators videotape or audiotape readers as they think aloud 
while comprehending a text and/or while interacting with machines, toys, devices, 
equipment, and the Uke. The transcripts are then analyzed for evidence of readers* 
problem-solving strategies, comprehension, miscues and error recovery, access and 
retrieval behaviors, inferences and predictions, along with comments indicating satisfaction 
or preference. Such information is tiien used to guide revision activity. Protocol-aided 
revision is an iterative process involving testing a text with members of the intended 
audience, revising based on the problems readers experience, followed by more testing and 
revising until the text satisfies the reader's needs and the writer's goals. 

In 1986, Diehli compared think-aloud protocols with some other methods 
(guidelines, a computer-based style program called '*Murky," and checklists called revision 
filters) to determine tiie kind of information provided by each [59]. Results showed that 
no single metiiod was best but tiiat guidelines were worst, reiterating tiiat writers need to 
consider the costs and benefits associated witii alternative evaluation metiiods. And 
Holland and her colleagues [119], who studied writers revising procedural instructions 
after watching videotapes of readers using tiieir texts, found that writers who observed 
readers-in-action were much more able to solve text problems that were specific to the 
rhetorical situation — ^problems for which guidelines were too general to be helpful. 

Although think-aloud protocols have obvious advantages over other methods, it is 
important to recognize their limitations as well. Glass, Holyoak, and Santa raise the 
following issues: 

• Often a protocol will seem to have "gaps" in which the participant forgets to speak. 

• Sometimes participants will take a ''mental leap" reaching some conclusion without 
mentioning any intermediate steps. 

« Sometimes the protocol will be ambiguous and difficult to interpret 

• They are time-consuming. 
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• They are verbal and are difficult if not impossible to conduct with children. 

• If participants are using visual Imagery or some other nonverbal representation, 
they may be unable to talk about what they are doing. 

• Participants may use a more systematic method for solving problems than they 
would normally because they know they are being watched 1129, 416-417J. 

Despite these limitations, protocol analysis remains one of the most informative methods 
for studying problem-solving behavior. 

A few years ago, I observed that writers working at Carnegie Mellon's 
Communications Design Center who had extensive experience usmg protocol-aided 
revision seemed better able to anticipate a reader's interacaon with their texts than were 
other professional writers with years of on-the-job professional wnting and editing 
experience. When I questioned these writers about why they were so good, they claimed 
that protocols changed not only the way they revised text, but the way they planned. 
Indeed these writers had collected and evaluated the transcnpts of dozens of ihink-a^oud 
protocols. Their claim both intrigued and puzzled me. I found that writers were unable to 
articulate in what way(s) protocols had changed their writing. 

I wondered if their superior skill in evaluating and revising text resulted from their 
frequent and direct experience with reader feedback. I tiiought that if this were true, a 
sequence of lessons tiiat took writers through a similar experience might help them increase 
their sensitivity to readers' needs. To tiiis end, I refined the protocol-aided revision 
metiiodology, characterized the cognitive processes involved in using the method UU, 
211, and developed and evaluated a protocol-aided revision pedagogy. The aim ot the 
teaching method (described elsewhere in detail) was to give writers tiie benefits of 
protocols without die need to collect protocols on every text [15]. 

After training in the protocol-aided revision pedagogy, writers were tested on their 
ability to accurately predict readers' problems with texts in which protocols were 
unavailable. Five classes of writers taught with protocols were compared with five classes 
of writers taught using guidelines, audience analysis heuristics, and peer review 
procedures— that is, with more standard text-focused and expert-judgment-focused 
approaches. In particular, writers were compared for tiieir ability to detect and diagnose 
readers' problems along three dimensions: 

• Commission versus omission, that is, problems caused by what the text says 
versus what it leaves out. 

• Problems characterized from the perspective of the reader, t.he self (i.e., the 
writer), or die text. 

• Problems at the global or local level of die text 

Results show tiiat writers taught to anticipate readers' problems with poorly written 
instru^;. ->nal text, using the protocol-aided revision pedagogy, improve significantly (p < 
.005) in their ability to judge readers' problems accurately. More specifically, writers 
taught with the protocol-aided revision method improve in their ability to predict problems 
of omission, problems from the readers' point of view, and global problems. For each of 
the three types of diagnostic categories, experimental writers improved more than did 
control writers (p < .005). Writers in the experimental group made dramatic gains in their 
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ability to detect and diagnose problems caused by difficulties such as poor organization, 
ambiguous puipose statements, missing illustrations and diagrams, faulty analogies, and 
unclear procedures. 

In addition, writers who were taught to anticipate readers' problems by studying the 
protocol transcripts of lay readers comprehending instructional texts (in this case, computer 
manuals) were able to transfer their knowledge to anticipating lay readers' problems with 
elementary science texts. Thus, learning about how readers responded to one genre helped 
writers anticipate readers' problems with another. Such results also underscore the benefits 
of using protocol-aided revision not only for improving texts under evaluation, but for 
enhancing writers' skills generally. 

Retrospective Testing 

Retrospective metiiods are the more frequentiy used of tiie reader-focused metiiods. 
They include a wide range of comprehension tests, along witii methods such as surveys, 
interviews, focus groups, critical incidents, and reader feedback cards. The problems 
associated with retrospective reports have been well documented by Ericsson and Simon 
[124]. Aside from the drawback of asking readers to reflect on their remembrance of 
comprehending the text, the primary disadvantage of retrospective tests is that they 
frequentiy fail to pinpoint specific text features tiiat need revision, and instead, often give 
the revisor vague and often uninterpretable feedback, e.g., respondents on a reader- 
feedback card may write, ''it was pretty easy to read except for some of the procedures." 

Comprehension testing has been a widely-used retrospective measure in evaluating 
text quality. Basically it involves asking readers to paraphrase, recall, summarize, 
recognize, or draw inferences about particular text items or textual features through having 
them engage in activities such :\s true/false, fill-m-tiie-blank, essay, or multiple choice tests. 
Typically, text e valuators using comprehension testing look for readers' abilities to make 
judgments and inferences about tiie text's content. As with otiier evaluation methods, the 
success and value of comprehension measures is direcdy related to the quality of tiie test 
itself. Results obtained by tiie use of poorly-constructed questions are likely to produce 
trivial results. 

Besides the very familiar types of recall and recognition testing used in school 
settings and standardized test situations, other ways Uiat comprehension is often assessed 
focus on summary, paraphrase, or inference measures. Witii tiiese tests, participants are 
asked to read a text (or portions of it) and then to summarize or paraphrase the main ideas. 
Researchers are often interested in the number and importance of idea units recalled, tiie 
number and type of elaborations and integrations made, the number and kind of inferences 
drawn, and the number and type of errors made. Such tests are often very useful in 
pinpointing people's reactions to subtie cues in the text 

For example, in evaluating how people understand texts such as unemployment 
compensation brochures and policjr statements, writers have found it useful to study what 
people infer as tiiey read. Such testing shows that people tend to draw elaborate (and often 
incorrect) inferences from statements about benefits that are made in such policies. 
Inference testing is likely to become a frequently-used method in the 1990s, especially with 
so many companies worried over lawsuits related to the misunderstanding of written 
information [130, 131]. For instance, tampon companies have been trying to determine 
what they must do in creating warning labels and package inserts to limit tiieir liability in 
cases of toxic shock syndrome. 
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In assessing participants' performance on comprehension tests, evaluators typically 
use either criterion-referenced or norm-referenced approaches. Dick and Carey explain that 
the difference between these approaches lies in how tests results are interpreted [132]. In 
criterion-referenced tests (sometimes called mastery tests), the performance of all 
participants is compared to a preestablished criteria for success. For example, in testing the 
effectiveness of a procedures manual for operating a computer, one might set a criterion 
that users must be able perform the procedures with 85 percent accuracy. Thus, testing and 
revising would take place until all participants were able to meet the criterion using the text. 

On the other hand, norm-referenced testing compares the performance of participants 
witii each other (either within a group or between groups). The participants' rank or 
position in the group becomes a reference point for determining the quality of performance 
rather than a meeting a specified mastery level. Since many contexts for assessing text 
quality are ones in which it is impractical (and irrelevant) to set rigid criterion levels, norm- 
reference testing is a useful alternative. For example, evaluators may want to know which 
of two texts is better for conveying detailed visual information, e.g., a full color 
photograph or a black and white line drawing? Similarly, evaluators may want to know 
which of several groups of readers respond most favorably to particular text features — for 
example, do experts retain more information from line drawings than do novices? The idea 
is to judge the relative quality of the text by looking at readers' performance in comparison 
to each otiier. 

Swyeys and interviews, perhaps tiie most commonly used methods for evaluating 
text quality, range from face-to-face procedures to pen and paper questionnaires to 
telephone and online surveys [133-137]. With surveys and interviews, participants 
tyi)ically respond to a mix of open-ended and close-ended questions designed to elicit 
opinions about the use of visual and verbal text features along dimensions such as 
comprehensibility and persuasiveness. The advantages of surveys and questionnaires are 
that they are relatively inexpensive, they can be self-administered, they do not require much 
time, and respondents can remain anonymous. (For a brief discussion of some of tiie types 
of survey scales, see Davis and Mentecki [138].) A major disadvantage is that quite often 
the participants are self-selected, thus biasing the results. From a reviser's perspective, 
surveys also have the drawback tiiat if readers rate the text poorly, evaluators must conduct 
other tests to determine the particular text features or portions of text that caused problems 
for readers [139]. In addition, survey response rate may be low and participants often 
ignore some questions (especially open-ended questions that require time to respond). For 
a discussion of how surveys have been used in learning about writing in the workplace, see 
Anderson [140] and for some of tiie problems associated witii survey research done in the 
field of technical communication, sec Isakson and Spyridakis [141], 

Interviews, on the other hand, do provide participants witii tiie opportunity to discuss 
a text at lengtii and allow tiie evaluator to probe individual responses in detail. See the 
U.S. Department of Healtii and Human Services' approach to conducting individual in- 
deptii interviews or central location intercept interviews— interviews conducted in locations 
ftequented by representative members of a text's target audience [97]. For example, they 
describe a pilot maternal and child healtii care program in which interviewers went to 
several clinics in large metropolitan areas to talk with tiie intended audience of pregnant 
women and pretest a bilingual (Spanish/English) booklet on breast feeding [97, 17-18]. 
They point out that interviews are an extremely rich data source about how a text is 
working because people often feel more comfortable answering interview questions tiian 
objective test items. Disadvantages of interviews include that tiiey are time-consuming to 
conduct and tiie data are often very difficult to analyze, tiius making it haixi to generalize 
from them. 
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Focus groups, a method using group interview procedures for evaluation, has been a 
very popular means of pretesting the marketability of consumer products [142-145] 
Focus groups use open-ended interviews to solicit people's attitudes, perceptions, and 
opmions about a single text or sometimes a group of texts, such as a new science textbook 
tor a particular grade level or a new science textbook series for several grade levels of an 
entire school district. Focus groups in such a case could be helpful in discovering the 
kinds of text features teachers pay attention to when using a textbook and the range of 
factors that influence their choice of one text over another. (Unfortunately, up to this point 
most focus groups aiming to evaluate text quality are actually subject-matter expert 
interviews— in this case, interviews with "expert" teachers or school system 
administrators.) Although in this example, the teachers are an important audience for 
judging text quality, it would be 

better to conduct the focus groups with the students who will be reading the science texts. 
See Markle [146] or Pepper [147] for a discussion of the value of using student feedback 
to improve instructional materials. 

Nonetheless, writers can use the kind of information generated by focus group 
discussions in planning and revising their texts. Under ideal circumstances, "the focus 
poup presents a natural environmi\nt where participants are influencing and are influenced 
by others--just as they do in real life" [142, 30]. According to Krueger, focus groups 
have several distinct advantages and disadvantages: 

• It is a socially-oriented research method capturing real-life data in a social 
envux>nment. 

• It has flexibility. 

• It has high face validity. 

• It has speedy results. 

« 

• It is low in cost [142, 47], 

But focus groups have limitations that affect the quality of the results: 

• Focus groups afford the researcher less control than individual interviews. 

• Data are difficult to analyze. 

• Moderators require special skills. 

• Differences between groups can be troublesome. 

• Groups are often difficult to assemble. 

• The discussion must be conducted in a conducive environment [142, 48], 

£-1? H- '"^'Vf"'-^' a method Which asks participants to remember salient aspects of 
tneir interaction with a text, is designed to elicit readers' memories of positive or negative 

nimh^^^fJP^'^'^ "^^"S ^'^^ l^^l- For example, WflUgel 

He alkfn«Scin«m '^^"^"^ ^^''^ accompanying dobumentatiSn. 

SLf,cclS 1."^^^ ^ P°"*^^« "«8ative incident usmg the computer, to 

?h^THi°„T Ty "^f incident occulted and then to rate tiie relev^ice and severity of 
the mcident m terms of "How much does this factor matter to you?" A similar technique is 
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called '^storytelling"; participants are asked to tell the evaluator a narrative that reveals their 
attitudes and experiences related to text type or genre. Sometimes participants are provided 
with a scenario and are asked to complete the story discussing how and when they might 
use the text under evaluation. A key drawback of these methods is that they place an 
enormous burden on memory and may predispose participants to exaggerate, thus not 
providing vei^' accurate or reliable data. 

Another common retrospective test is the reader feedback card which is usually found 
at the end of a book or an instructional guide. The idea is to gather perceptions about text 
quality through having readers fill in a series of close-ended and/or open-ended survey 
questions. Again, reader feedback cards have the inherent bias of self-selected participants 
who are lavish witii praise or condemnation for a text 

Summary 

Overall, retrospective testing can provide extremely useful data for revising text. 
It is clear, however, tiiat most researchers agree that concurrent measures provide the most 
reliable data. For this reason, retrospective methods should be used in conjunction with 
concunent methods for greater reliability. 



CONCLUSION 

Earlier I argued tiiat an optimal text-evaluation metiiod should provide writers with 
two sorts of information: (1) information about whole-text or global aspects of text quality, 
and (2) information about how the audience may respond to the text. Clearly, research and 
experience show us tiiat reader-focused testing metiiods have die advantage on botii counts. 
>Vhen practical considerations such as time and expense allow, reader-focused methods are 
preferable to text-focused and expert-judgment- focused metiiods because they shift the 
primary job of representing the text's problecjs from die writer or expert to the reader. 
Thus, xeaider-focused metiiods help minimize the chances of failing to detect problems. In 
addition, reader-focused methods expand the scope of text problems that get noticed, 
shifting the evaluator's attention to global problems, especially problems of visual and 
verbal omissions. Most writers and readers would agree that perhaps tiie biggest problem 
with poorly written text lies not in what it says but in what lijails to say. Overall, reader- 
focused metiiods such as protocol-aided revision can help writers achieve a better model of 
readers actively engaged in meaning constmction. Such a model of readers is helpful not 
only in revising the text under evaluation, but in planning and revising future text 
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